[SOUND]
This lecture is about natural language
content analysis.
Natural language content analysis
is the foundation of text mining.
So we're going to first talk about this.
And in particular,
natural language processing with
a factor how we can present text data.
And this determines what algorithms can
be used to analyze and mine text data.
We're going to take a look at the basic
concepts in natural language first.
And I'm going to explain these concepts
using a similar example
that you've all seen here.
A dog is chasing a boy on the playground.
Now this is a very simple sentence.
When we read such a sentence
we don't have to think
about it to get the meaning of it.
But when a computer has to
understand the sentence,
the computer has to go
through several steps.
First, the computer needs
to know what are the words,
how to segment the words in English.
And this is very easy,
we can just look at the space.
And then the computer will need
the know the categories of these words,
syntactical categories.
So for example, dog is a noun,
chasing's a verb, boy is another noun etc.
And this is called a Lexical analysis.
In particular, tagging these words
with these syntactic categories
is called a part-of-speech tagging.
After that the computer also needs to
figure out the relationship between
these words.
So a and dog would form a noun phrase.
On the playground would be
a prepositional phrase, etc.
And there is certain way for
them to be connected together in order for
them to create meaning.
Some other combinations
may not make sense.
And this is called syntactical parsing, or
syntactical analysis,
parsing of a natural language sentence.
The outcome is a parse tree
that you are seeing here.
That tells us the structure
of the sentence, so
that we know how we can
interpret this sentence.
But this is not semantics yet.
So in order to get the meaning we
would have to map these phrases and
these structures into some real world
antithesis that we have in our mind.
So dog is a concept that we know,
and boy is a concept that we know.
So connecting these phrases
that we know is understanding.
Now for a computer, would have to formally
represent these entities by using symbols.
So dog, d1 means d1 is a dog.
Boy, b1 means b1 refers to a boy etc.
And also represents the chasing
action as a predicate.
So, chasing is a predicate here with
three arguments, d1, b1, and p1.
Which is playground.
So this formal rendition of
the semantics of this sentence.
Once we reach that level of understanding,
we might also make inferences.
For example, if we assume there's a rule
that says if someone's being chased then
the person can get scared, then we
can infer this boy might be scared.
This is the inferred meaning,
based on additional knowledge.
And finally, we might even further infer
what this sentence is requesting,
or why the person who say it in
a sentence, is saying the sentence.
And so, this has to do with
purpose of saying the sentence.
This is called speech act analysis or
pragmatic analysis.
Which first to the use of language.
So, in this case a person saying this
may be reminding another person to
bring back the dog.
So this means when saying a sentence,
the person actually takes an action.
So the action here is to make a request.
Now, this slide clearly shows that
in order to really understand
a sentence there are a lot of
things that a computer has to do.
Now, in general it's very hard for
a computer will do everything,
especially if you would want
it to do everything correctly.
This is very difficult.
Now, the main reason why natural
language processing is very difficult,
it's because it's designed it will
make human communications efficient.
As a result, for example,
with only a lot of common sense knowledge.
Because we assume all of
us have this knowledge,
there's no need to encode this knowledge.
That makes communication efficient.
We also keep a lot of ambiguities,
like, ambiguities of words.
And this is again, because we assume we
have the ability to disambiguate the word.
So, there's no problem with
having the same word to mean
possibly different things
in different context.
Yet for
a computer this would be very difficult
because a computer does not have
the common sense knowledge that we do.
So the computer will be confused indeed.
And this makes it hard for
natural language processing.
Indeed, it makes it very hard for
every step in the slide
that I showed you earlier.
Ambiguity is a main killer.
Meaning that in every step
there are multiple choices,
and the computer would have to
decide whats the right choice and
that decision can be very difficult
as you will see also in a moment.
And in general,
we need common sense reasoning in order
to fully understand the natural language.
And computers today don't yet have that.
That's why it's very hard for
computers to precisely understand
the natural language at this point.
So here are some specific
examples of challenges.
Think about the world-level ambiguity.
A word like design can be a noun or
a verb, so
we've got ambiguous part of speech tag.
Root also has multiple meanings,
it can be of mathematical sense,
like in the square of, or
can be root of a plant.
Syntactic ambiguity refers
to different interpretations
of a sentence in terms structures.
So for example,
natural language processing can
actually be interpreted in two ways.
So one is the ordinary meaning that we
will be getting as we're
talking about this topic.
So, it's processing of natural language.
But there's is also another
possible interpretation
which is to say language
processing is natural.
Now we don't generally have this problem,
but imagine for the computer to determine
the structure, the computer would have
to make a choice between the two.
Another classic example is a man
saw a boy with a telescope.
And this ambiguity lies in
the question who had the telescope?
This is called a prepositional
phrase attachment ambiguity.
Meaning where to attach this
prepositional phrase with the telescope.
Should it modify the boy?
Or should it be modifying, saw, the verb.
Another problem is anaphora resolution.
In John persuaded Bill to buy a TV for
himself.
Does himself refer to John or Bill?
Presupposition is another difficulty.
He has quit smoking implies
that he smoked before, and
we need to have such a knowledge in
order to understand the languages.
Because of these problems, the state
of the art natural language processing
techniques can not do anything perfectly.
Even for
the simplest part of speech tagging,
we still can not solve the whole problem.
The accuracy that are listed here,
which is about 97%,
was just taken from some studies earlier.
And these studies obviously have to
be using particular data sets so
the numbers here are not
really meaningful if you
take it out of the context of the data
set that are used for evaluation.
But I show these numbers mainly to give
you some sense about the accuracy,
or how well we can do things like this.
It doesn't mean any data set
accuracy would be precisely 97%.
But, in general, we can do parsing speech
tagging fairly well although not perfect.
Parsing would be more difficult, but for
partial parsing, meaning to get some
phrases correct, we can probably
achieve 90% or better accuracy.
But to get the complete parse tree
correctly is still very, very difficult.
For semantic analysis, we can also do
some aspects of semantic analysis,
particularly, extraction of entities and
relations.
For example, recognizing this is
the person, that's a location, and
this person and
that person met in some place etc.
We can also do word sense to some extent.
The occurrence of root in this sentence
refers to the mathematical sense etc.
Sentiment analysis is another aspect
of semantic analysis that we can do.
That means we can tag the senses
as generally positive when
it's talking about the product or
talking about the person.
Inference, however, is very hard,
and we generally cannot do that for
any big domain and if it's only
feasible for a very limited domain.
And that's a generally difficult
problem in artificial intelligence.
Speech act analysis is
also very difficult and
we can only do this probably for
very specialized cases.
And with a lot of help from humans
to annotate enough data for
the computers to learn from.
So the slide also shows that
computers are far from being able to
understand natural language precisely.
And that also explains why the text
mining problem is difficult.
Because we cannot rely on
mechanical approaches or
computational methods to
understand the language precisely.
Therefore, we have to use
whatever we have today.
A particular statistical machine learning
method of statistical analysis methods
to try to get as much meaning
out from the text as possible.
And, later you will see
that there are actually
many such algorithms
that can indeed extract
interesting model from text even though
we cannot really fully understand it.
Meaning of all the natural
language sentences precisely.
[MUSIC]

